pref: Enable mgpu in FrameView by pv-nvidia · Pull Request #5514 · isaac-sim/IsaacLab

pv-nvidia · 2026-05-06T12:30:46Z

Description

Removes the cuda:0-only restriction in FabricFrameView. USDRT SelectPrims now accepts any CUDA device index, so Fabric acceleration runs on the simulation device (e.g., cuda:1) instead of silently falling back to the slower USD path. This unblocks distributed training where each process is pinned to a specific GPU.

The user-facing surface change is small (drops the device guard in __init__, the assertion in _initialize_fabric, and the _fabric_supported_devices allowlist). The remainder of the diff comprises:

Write-path refactor. set_world_poses, set_scales, and the initial USD→Fabric sync now share a single _compose_fabric_transform helper. The sync collapses to one combined kernel launch with one PrepareForReuse, eliminating the previous "double PrepareForReuse" code smell where a non-idempotent second call could mask a topology-change signal.
Robustness. The topology-change invariant in _rebuild_fabric_arrays now raises RuntimeError instead of using assert, so the check survives python -O. Previously, an -O run with a real prim-count mismatch would silently dispatch with a stale _view_to_fabric mapping and produce wrong poses or out-of-bounds kernel indices.
Multi-GPU test coverage. Three cuda:1-parameterized tests guarded by a new multi_gpu pytest marker (registered in pyproject.toml), plus a dedicated CI job on the existing multi-GPU runner so regressions show up automatically on PRs that touch FabricFrameView or its test file.

The skip-vs-fail logic in _skip_if_unavailable is intentional and works in concert with the CI workflow:

On a developer workstation with a single GPU, cuda:1 tests pytest.skip so local runs stay green.
In CI, the runner is guaranteed to have ≥2 GPUs by a pre-flight nvidia-smi + torch.cuda.device_count() check at the top of test-fabric-multi-gpu. If the runner regresses to a single GPU, the pre-flight emits ::error:: and exits non-zero before pytest is even invoked, so a misconfigured runner cannot silently green-light a PR.

The test helper used to special-case GITHUB_ACTIONS=true and call pytest.fail, but that branch was removed: the workflow pre-flight catches the same failure mode without coupling the test file to os.environ.

Type of change

New feature (non-breaking change which adds functionality)

cuda:0 continues to work exactly as before; cuda:1+ now also works instead of silently falling back to USD. No public API surface changed.

Checklist

I have read and understood the contribution guidelines
I have run the pre-commit checks with ./isaaclab.sh --format
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
I have updated the changelog and the corresponding version in the extension's config/extension.toml file
I have added my name to the CONTRIBUTORS.md or my name already exists there

Note on the changelog item: this PR uses a fragment file at source/isaaclab_physx/changelog.d/feat-frame-view-enable-mgpu.rst per the new fragment-based changelog system (#5434). CHANGELOG.rst and config/extension.toml are intentionally not edited directly — the nightly CI workflow compiles them from fragments. The fragment covers the cuda:N enablement (Fixed), the combined-sync refactor (Changed), and the assert → RuntimeError fix (Fixed).

Test plan

Three new tests, all marked @pytest.mark.multi_gpu and parameterized with ["cuda:1"]:

test_fabric_cuda1_world_pose_roundtrip — set_world_poses → get_world_poses returns the same values on a non-primary CUDA device.
test_fabric_cuda1_no_usd_writeback — Fabric writes on cuda:1 do not write back to USD (atol=0.0 — equality, not approximate).
test_fabric_cuda1_scales_roundtrip — covers the set_scales write path on cuda:1, since both Fabric write paths now run on self._device.

A new CI job, test-fabric-multi-gpu, runs in .github/workflows/test-multi-gpu.yaml on the existing [self-hosted, linux, x64, gpu, multi-gpu] runner. The job pre-flights with nvidia-smi and ./isaaclab.sh -p -c "import torch; print(torch.cuda.device_count())", and fails loudly with ::error:: if the runner regresses to a single GPU. Triggered automatically on PRs that touch source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py or its test file.

To verify locally on a multi-GPU machine:

./isaaclab.sh -p -m pytest -m multi_gpu \
    source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py -v

To verify the cuda:0 path is unchanged:

./isaaclab.sh -p -m pytest -m "not multi_gpu" \
    source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py -v

greptile-apps · 2026-05-06T12:35:49Z

Greptile Summary

This PR lifts the cuda:0-only restriction in FabricFrameView, allowing Fabric GPU acceleration on any CUDA device index. The primary code change is dropping the _fabric_supported_devices allowlist and replacing it with a direct self._device pass-through to SelectPrims.

Write-path refactor: set_world_poses, set_scales, and the initial USD→Fabric sync are unified into a single _compose_fabric_transform helper, guaranteeing PrepareForReuse is called exactly once per logical write instead of twice during the initial sync.
Robustness fix: The topology-change invariant in _rebuild_fabric_arrays now raises RuntimeError instead of assert, so it survives python -O rather than silently corrupting pose data.
Test & CI: Three new @pytest.mark.multi_gpu tests cover the cuda:1 write/read/no-writeback paths; a new test-fabric-multi-gpu CI job with a GPU pre-flight check ensures these tests cannot silently green-light on a regressed single-GPU runner.

Confidence Score: 5/5

Safe to merge; the device-guard removal is well-scoped, the write-path refactor correctly eliminates the double PrepareForReuse, and the CI pre-flight ensures multi-GPU coverage cannot silently regress.

The core change (passing self._device directly to SelectPrims) is minimal and the surrounding refactor is correctness-improving. The only outstanding note is a stale one-line docstring that still advertises the old device restriction; everything else in the diff is either a genuine fix or well-tested new coverage.

The init docstring in fabric_frame_view.py still lists the old 'cpu' or 'cuda:0' restriction and should be updated to match the new behavior.

Important Files Changed

Filename	Overview
source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py	Removes cuda:0-only restriction, adds _compose_fabric_transform helper, replaces assert with RuntimeError, and drops CPU device guard. Stale docstring still lists "cpu" or "cuda:0" only.
source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py	Adds sys.argv stripping for Kit startup safety, improves _skip_if_unavailable for cuda:N devices, and adds three new multi-GPU test cases for cuda:1.
.github/workflows/test-multi-gpu.yaml	Adds test-fabric-multi-gpu CI job with a GPU pre-flight check that fails loudly if the runner regresses to a single GPU, preventing silent green runs with zero coverage.
pyproject.toml	Registers new multi_gpu pytest marker with a clear description of its skip-on-single-GPU semantics.
source/isaaclab_physx/changelog.d/feat-frame-view-enable-mgpu.rst	New fragment-based changelog entry covering the multi-GPU enablement, combined-sync refactor, and assert→RuntimeError fix.

_{Reviews (4): Last reviewed commit: "Address review feedback on the multi-GPU..." | Re-trigger Greptile}

isaaclab-review-bot

Review Summary

Clean, well-motivated PR that removes an artificial device restriction. The refactoring of set_world_poses and set_scales into a single _compose_fabric_transform helper is a nice improvement — it ensures PrepareForReuse is only called once per logical update regardless of which components are being set.

Strengths

Single kernel launch for combined updates — _sync_fabric_from_usd_once now does pos+orient+scale in one pass instead of two separate calls
assert → RuntimeError — Topology guard won't be optimized away with python -O
Thorough multi-GPU test coverage with proper skip/fail semantics depending on CI vs local
CI workflow integration — dedicated test-fabric-multi-gpu job triggered by relevant file changes
sys.argv stripping — prevents Kit segfault from pytest flags

Minor Notes

See inline comments.

isaaclab-review-bot

🔍 Code Review for PR #5514: Multi-GPU FabricFrameView Enablement

Reviewing changes since last review (merge from develop b3b770d8 → c3798134)

✅ Overall Assessment

This PR cleanly enables FabricFrameView on non-primary CUDA devices, unblocking distributed training scenarios. The implementation is well-structured with appropriate test coverage.

🎯 Key Findings

1. Excellent Refactoring: Single Write Path ✅

The consolidation of set_world_poses and set_scales into the shared _compose_fabric_transform helper is a significant improvement:

Reduces code duplication from ~40 lines to a single method call
Ensures PrepareForReuse is invoked exactly once per logical update
Properly handles the initial USD→Fabric sync in one atomic operation

def _sync_fabric_from_usd_once(self) -> None:
    # Now combines all components in one write — clean!
    self._compose_fabric_transform(
        positions=positions_usd_ta.warp,
        orientations=orientations_usd_ta.warp,
        scales=scales_usd,
    )

2. Proper Assert→RuntimeError Migration ✅

The change from assert to RuntimeError for topology-change detection is correct and necessary:

if self.count != self._default_view_indices.shape[0]:
    raise RuntimeError(
        f"Prim count changed ({self.count} vs {self._default_view_indices.shape[0]}). "
        "Fabric topology change added/removed tracked prims — full re-initialization required."
    )

This ensures the invariant check survives python -O optimization, preventing silent corruption.

3. Test Coverage is Thorough ✅

The three new multi-GPU tests cover the critical paths:

test_fabric_cuda1_world_pose_roundtrip — validates the primary write path
test_fabric_cuda1_no_usd_writeback — confirms Fabric isolation from USD
test_fabric_cuda1_scales_roundtrip — covers the secondary write path

The _skip_if_unavailable helper correctly uses pytest.skip (not pytest.fail) for graceful CI handling.

4. CI Configuration is Well-Designed ✅

The new test-fabric-multi-gpu job in .github/workflows/test-multi-gpu.yaml:

Targets the correct runner labels: [self-hosted, linux, x64, gpu, multi-gpu]
Triggers on changes to the relevant files
Uses the multi_gpu marker filter to run only the appropriate tests

💡 Minor Suggestions (Non-Blocking)

Documentation String Enhancement (optional)

The _compose_fabric_transform docstring could mention that the method handles the _fabric_usd_sync_done flag update, clarifying the state machine:

def _compose_fabric_transform(...):
    """...
    
    Also sets ``_fabric_usd_sync_done = True`` after successful write,
    indicating that Fabric and USD are now out of sync (Fabric is authoritative).
    """

Changelog Fragment ✅

The changelog at source/isaaclab_physx/changelog.d/feat-frame-view-enable-mgpu.rst is well-written and accurately describes both the fix and the changed behavior.

🔒 Safety Checklist

No breaking changes to public API
cuda:0 behavior unchanged
Error messages are clear and actionable
Tests guard against regressions
CI properly configured for multi-GPU validation

📋 Summary

Aspect	Status
Code Quality	✅ Clean
Test Coverage	✅ Comprehensive
Documentation	✅ Complete
CI Integration	✅ Proper
Breaking Changes	✅ None

Recommendation: This PR is ready for merge. The latest changes are only merge commits from develop with no substantive code changes since the last review.

Note: This PR depends on #5380 — ensure that PR is merged first.

isaaclab-review-bot · 2026-05-13T11:01:26Z

🤖 Isaac Lab Review Bot — Follow-up Review (`c1254cd`)

All previously raised concerns have been addressed ✅

Resolved Issues from Previous Review:

Issue	Status
P1 - `assert` → `RuntimeError` for topology-change invariant	✅ Fixed
P2 - `PrepareForReuse` called twice during USD→Fabric sync	✅ Fixed (consolidated into single `_compose_fabric_transform` call)
`sys.argv` stripping documentation	✅ Added explicit comment
Missing GPU pre-flight in `test-fabric-multi-gpu` workflow	✅ Added "Verify multi-GPU availability" step

Summary of Additional Changes:

Version bumps:

isaaclab 5.1.0 → 5.1.1
isaaclab_newton 0.8.0 → 0.8.1
isaaclab_ov 0.1.7 → 0.1.8
isaaclab_physx 0.6.3 → 0.6.4
isaaclab_tasks 1.5.37 → 1.5.38

Dependency updates:

newton v1.2.0rc2 → v1.2.0rc3 (now packaged, not git+)
mujoco-warp 3.8.0.1 → 3.8.0.2
warp-lang pinned 1.13.0 → >=1.13.0

Bug fixes:

Carbonite null-client error in _fabric_notices.py and _cubric.py
ContactSensor metadata extraction for Newton 1.1+ (scalar types, per-row indices)
Dexsuite: mesh-based spawning for Newton heterogeneous object support

New features:

test-fabric-multi-gpu.yaml CI workflow
multi_gpu pytest marker

Reviewed commits: ada2608 → c1254cd

- Allow FabricFrameView to run on cuda:N for any N; USDRT SelectPrims no longer needs cuda:0. - Refactor the Fabric write path into a single _compose_fabric_transform helper shared by set_world_poses, set_scales, and the initial USD->Fabric sync, collapsing the sync to one kernel launch with one PrepareForReuse. - Replace the topology-invariant assert with RuntimeError so it survives python -O. - Add multi_gpu pytest marker plus cuda:1 unit-test coverage for both Fabric write paths, and run them in the existing test-multi-gpu CI job (one extra step, no new job).

The standard pytest invocation in CI runs the fabric test file without filtering on the ``multi_gpu`` marker, so the ``cuda:1`` tests get scheduled on every runner including the single-GPU ones. Previously ``_skip_if_unavailable`` hard-failed via ``pytest.fail`` whenever ``GITHUB_ACTIONS=true`` and the requested device was missing, on the theory that this would catch a misconfigured multi-GPU runner. In practice it just broke the standard CI: the dedicated ``test-fabric-multi-gpu`` workflow already pre-flights ``torch.cuda.device_count() >= 2`` before invoking pytest, so a genuinely misconfigured multi-GPU runner is already caught there. Always skip rather than fail when the requested ``cuda:N`` index isn't available. Drop the now-unused ``import os``.

Kit's CLI parser reads sys.argv directly at startup and segfaults on pytest flags that collide with its own short options. Running pytest -m multi_gpu source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py crashes during collection because Kit sees ``-m multi_gpu`` and exits with ``Ill formed parameter: -m`` followed by SIGSEGV (exit code 245) inside ``simulation_app._start_app``. Strip sys.argv to argv[0] before instantiating AppLauncher. The test file takes no CLI arguments of its own, mirroring the broader pattern used by ``test_tiled_camera_env.py`` which assigns ``sys.argv[1:] = args_cli.unittest_args`` after argparse.

wp.to_torch on a ProxyArray is deprecated in favor of the .torch accessor. Switch the three call sites that consume the ProxyArray returned by get_world_poses; leave get_scales call sites alone since that method still returns a raw wp.array (no .torch accessor).

- Add a GPU-count pre-flight step to the test-fabric-multi-gpu CI job so a runner regression to a single GPU fails the workflow instead of silently skipping every cuda:1 test. This is what the comment in _skip_if_unavailable already promised existed. - Note that the sys.argv strip in test_views_xform_prim_fabric.py must stay between the AppLauncher import and its instantiation; any CLI parser or reordering re-exposes Kit to pytest argv and segfaults at startup. - Document the _fabric_usd_sync_done side effect on _compose_fabric_transform so callers can see why subsequent getters stop pulling from USD.

The class docstring and __init__ device-param doc still claimed ``cuda:0`` only. Refresh both to note that Fabric acceleration runs on any CUDA index, so the autodoc API page reflects the actual contract.

Move the test-fabric-multi-gpu job out of test-multi-gpu.yaml and into a dedicated test-fabric-multi-gpu.yaml. The two workflows share the same runner label, install step, and GPU pre-flight, but trigger on disjoint path sets so changes to FabricFrameView no longer gate the distributed-training validation and vice versa. test-multi-gpu.yaml is now byte-identical to upstream/develop.

pv-nvidia requested review from hhansen-bdai, kellyguo11 and pascal-roth as code owners May 6, 2026 12:30

github-actions Bot added isaac-lab Related to Isaac Lab team infrastructure labels May 6, 2026

pv-nvidia marked this pull request as draft May 6, 2026 12:32

pv-nvidia self-assigned this May 6, 2026

greptile-apps Bot reviewed May 6, 2026

View reviewed changes

Comment thread source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py Outdated

Comment thread source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py Outdated

pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch 4 times, most recently from a6cd73e to 2c619fe Compare May 7, 2026 08:44

isaaclab-review-bot Bot reviewed May 9, 2026

View reviewed changes

Comment thread source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py

Comment thread source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py

Comment thread source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py

pv-nvidia marked this pull request as ready for review May 11, 2026 11:29

isaaclab-review-bot Bot reviewed May 11, 2026

View reviewed changes

greptile-apps Bot reviewed May 11, 2026

View reviewed changes

Comment thread .github/workflows/test-multi-gpu.yaml Outdated

pv-nvidia changed the title ~~Feat/frame view enable mgpu~~ Enable mgpu in FrameView May 12, 2026

pv-nvidia changed the title ~~Enable mgpu in FrameView~~ pref: Enable mgpu in FrameView May 12, 2026

pv-nvidia added the enhancement New feature or request label May 12, 2026

pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch 2 times, most recently from ada2608 to c1254cd Compare May 13, 2026 10:54

pv-nvidia added 7 commits May 13, 2026 21:58

Update FabricFrameView docstrings for multi-GPU support

25afd53

The class docstring and __init__ device-param doc still claimed ``cuda:0`` only. Refresh both to note that Fabric acceleration runs on any CUDA index, so the autodoc API page reflects the actual contract.

pv-nvidia force-pushed the feat/frame-view-enable-mgpu branch from c1254cd to 1c2e02d Compare May 13, 2026 19:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pref: Enable mgpu in FrameView#5514

pref: Enable mgpu in FrameView#5514
pv-nvidia wants to merge 7 commits into
isaac-sim:developfrom
pv-nvidia:feat/frame-view-enable-mgpu

pv-nvidia commented May 6, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

isaaclab-review-bot Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

isaaclab-review-bot Bot left a comment

Uh oh!

Uh oh!

isaaclab-review-bot Bot commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pv-nvidia commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Checklist

Test plan

Uh oh!

greptile-apps Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

Review Summary

Strengths

Minor Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

🔍 Code Review for PR #5514: Multi-GPU FabricFrameView Enablement

✅ Overall Assessment

🎯 Key Findings

1. Excellent Refactoring: Single Write Path ✅

2. Proper Assert→RuntimeError Migration ✅

3. Test Coverage is Thorough ✅

4. CI Configuration is Well-Designed ✅

💡 Minor Suggestions (Non-Blocking)

🔒 Safety Checklist

📋 Summary

Uh oh!

Uh oh!

isaaclab-review-bot Bot commented May 13, 2026

🤖 Isaac Lab Review Bot — Follow-up Review (c1254cd)

Resolved Issues from Previous Review:

Summary of Additional Changes:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pv-nvidia commented May 6, 2026 •

edited

Loading

greptile-apps Bot commented May 6, 2026 •

edited

Loading

🤖 Isaac Lab Review Bot — Follow-up Review (`c1254cd`)